Analyse des performances de modèles de langage sub-lexicale pour des langues peu-dotées à morphologie riche
نویسندگان
چکیده
Performance analysis of sub-word language modeling for under-resourced languages with rich morphology : case study on Swahili and Amharic This paper investigates the impact on ASR performance of sub-word units for two underresourced african languages with rich morphology (Amharic and Swahili). Two subword units are considered : syllable and morpheme, the latter being obtained in a supervised or unsupervised way. The important issue of word reconstruction from the syllable (or morpheme) ASR output is also discussed. For both languages, best results are reached with morphemes got from unsupervised approach. It leads to very significant WER reduction for Amharic ASR for which LM training data is very small (2.3M words) and it also slightly reduces WER over a Word-LM baseline for Swahili ASR (28M words for LM training). A detailed analysis of the OOV word reconstruction is also presented ; it is shown that a high percentage (up to 75% for Amharic) of OOV words can be recovered with morph-based language model and appropriate reconstruction method. MOTS-CLÉS : Modèle de langage, Morphème, Hors vocabulaire, Langues peu-dotées.
منابع مشابه
Analyse des performances de modèles de langage sub-lexicale pour des langues peu-dotées à morphologie riche (Performance analysis of sub-word language modeling for under-resourced languages with rich morphology: case study on Swahili and Amharic) [in French]
Performance analysis of sub-word language modeling for under-resourced languages with rich morphology : case study on Swahili and Amharic This paper investigates the impact on ASR performance of sub-word units for two underresourced african languages with rich morphology (Amharic and Swahili). Two subword units are considered : syllable and morpheme, the latter being obtained in a supervised or...
متن کاملA State of the Art of Word Sense Induction: A Way Towards Word Sense Disambiguation for Under-Resourced Languages
______________________________________________________________________________________________ Word Sense Disambiguation (WSD), the process of automatically identifying the meaning of a polysemous word in a sentence, is a fundamental task in Natural Language Processing (NLP). Progress in this approach to WSD opens up many promising developments in the field of NLP and its applications. Indeed, ...
متن کاملExternal Lexical Information for Multilingual Part-of-Speech Tagging
Morphosyntactic lexicons and word vector representations have both proven useful for improving the accuracy of statistical part-of-speech taggers. Here we compare the performances of four systems on datasets covering 16 languages, two of these systems being feature-based (MEMMs and CRFs) and two of them being neural-based (bi-LSTMs). We show that, on average, all four approaches perform similar...
متن کاملMultilingual Compound Splitting (Segmentation Multilingue des Mots Composés) [in French]
Résumé La composition est un phénomène fréquent dans plusieurs langues, surtout dans des langues ayant une morphologie riche. Le traitement des mots composés est un défi pour les systèmes de TAL car pour la plupart, ils ne sont pas présents dans les lexiques. Dans cet article, nous présentons une méthode de segmentation des composés qui combine des caractéristiques indépendantes de la langue (m...
متن کاملG-OWL : Vers un langage de modélisation graphique, polymorphique et typé pour la construction d'une ontologie dans la notation OWL
Résumé : Le Web Ontology Language (OWL) standardisé par le W3C a pour objectif d’offrir un langage de conception d’ontologies pour le web sémantique. L’ingénierie d’une ontologie est une activité complexe nécessitant une habilité peu accessible à des experts de contenu. En revanche, pour modéliser du contenu métier, la modélisation graphique semi-formelle est une technique souvent employée pour...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012